Unsupervised Clustering of Text Entities in Heterogeneous Grey Level Documents

نویسندگان

  • Stéphane Bres
  • Véronique Eglin
  • Antoine Gagneux
چکیده

This paper presents a new method of functional classification of text blocks on a document. It is based on texture analysis and unsupervised classification. Texture is used here to define different classes of text blocks in the document and to direct a possible way of exploration from the most eye-catching data to the less significant text block. The typographical properties of blocks are characterized by two main discriminating primitives : the complexity of the text drawing and the structural relief of the block. This analysis is the starting point of a threeclasses categorization into functional families (main headings, sub-headings and text paragraphs). Each block of text is described and classified through a labeling process based on a 3D-feature space using the two previous features (complexity and structural relief) and a third one among pattern primitives, blocks size and location in the document. This method allows a first approach to a global context free classification of documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

PAYMA: A Tagged Corpus of Persian Named Entities

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...

متن کامل

An automatic approach for ontology-based feature extraction from heterogeneous textualresources

Data mining algorithms such as data classification or clustering methods exploit features of entities to characterise, group or classify them according to their resemblance. In the past, many feature extraction methods focused on the analysis of numerical or categorical properties. In recent years, motivated by the success of the Information Society and the WWW, which has made available enormou...

متن کامل

Topic Oriented Semi-supervised Document Clustering

In our study on developing a text mining prototype system, it is needed to group documents according to author’s need. However, Traditional documents clustering are usually considered an unsupervised learning. It cannot effectively group documents under user’s need. To solve this problem, we propose a new documents clustering approach. The main contributions include: (1) Describes user’s need b...

متن کامل

Heterogeneous Transfer Learning for Image Clustering via the SocialWeb

In this paper, we present a new learning scenario, heterogeneous transfer learning, which improves learning performance when the data can be in different feature spaces and where no correspondence between data instances in these spaces is provided. In the past, we have classified Chinese text documents using English training data under the heterogeneous transfer learning framework. In this pape...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002